EN FR
EN FR


Project Team Grand-large


Bibliography


Project Team Grand-large


Bibliography


Section: New Results

Communication avoiding algorithms for linear algebra

Participants : Laura Grigori, Simplice Donfack, Amal Khabou, Mathias Jacquelin, Sophie Moufawad.

The focus of this research is on the design of efficient parallel algorithms for solving problems in numerical linear algebra, as solving very large sets of linear equations and large least squares problems, often with millions of rows and columns. These problems arise in many numerical simulations, and solving them is very time consuming.

This research focuses on developing new algorithms for linear algebra problems, that minimize the required communication, in terms of both latency and bandwidth. We have introduced in 2008 two communication avoiding algorithms for computing the LU and QR factorizations, that we refer to as CALU and CAQR (joint work with J. Demmel and M. Hoemmen from U.C. Berkeley, J. Langou from C.U. Denver, and H. Xiang then at INRIA) [6] , [9] . Since then, we have also designed a communication avoiding algorithm for rank revealing QR. In addition, we have also extended theoretical lower bounds to sparse Cholesky factorization and identified algorithms that attain these bounds and so minimize communication. The communication avoiding algorithms are now studied by several other groups, including groups at INRIA, and they start being implemented and being available in public libraries as ScaLAPACK.

During 2011, our research has focused on a study of the stability of communication avoiding LU factorization and on its implementation on multicore machines. In [20] we focus on numerical properties of CALU. To decrease the communication required in the LU factorization, CALU uses a new pivoting strategy, referred to as tournament pivoting, that may lead to a different row permutation than the classic LU factorization with partial pivoting. We have further investigated the numerical stability of CALU. The reason to consider CALU is that it does an optimal amount of communication, and asymptotically less than Gaussian elimination with partial pivoting (GEPP), and so will be much faster on platforms where communication is expensive, as shown in previous work. We show that the Schur complement obtained after each step of performing CALU on a matrix A is the same as the Schur complement obtained after performing GEPP on a larger matrix whose entries are the same as the entries of A (sometimes slightly perturbed) and zeros. More generally, the entire CALU process is equivalent to GEPP on a large, but very sparse matrix, formed by entries of A and zeros. Hence we expect that CALU will behave as GEPP and it will be also very stable in practice. In addition, extensive experiments on random matrices and a set of special matrices show that CALU is stable in practice. The upper bound on the growth factor of CALU is worse than of GEPP. However, there are Wilkinson like-matrices for which GEPP has exponential growth factor, but not CALU, and vice-versa.

We present experimental results for random matrices and for a set of special matrices, including sparse matrices, for binary tree based and flat-tree-based CALU. We discuss both the stability of the LU factorization and of the linear solver, in terms of pivot growth and backward errors. The results show that in practice CALU is stable. We present the backward errors measured three ways: by PA-LU/A, by the normwise backward error Ax-b/(Ax+b), and by the componentwise backward error (after iterative refinement in working precision). For random matrices, all CALU's backward errors were at most 1.9x larger than GEPP's backward errors. We also tested "special" matrices, including known difficult examples: (1) The ratios of PA-LU/A were at most 1 in over 69% of cases (i.e. CALU was at least as stable as GEPP), and always 1.5 or smaller, except for one ratio of 4.3, in which case both backward errors were much smaller than 2 -53 = machine epsilon. (2) The ratios of normwise backward errors were at most 1 in over 53% of cases, and always 1.5 or smaller, except for 5 ratios ranging up to 26, in which cases all backward errors were less than 4x machine epsilon. (3) The ratios of componentwise backward errors were at most 1 in over 52% of cases, and always 3.2 or smaller, except for one ratio of 8.3.

In [30] we design a scheduling algorithm for efficiently executing CALU on multicore architectures. We focus on a tunable scheduling strategy that maintains load balance across cores while also maintaining data locality and low dequeue overhead. To achieve this, we use a strategy that combines static and dynamic scheduling. This approach was shown to be successful on regular mesh computations by V. Kale and B. Gropp. This tunable scheduling strategy allows us to flexibly control the percentage of tasks that can be scheduled dynamically; this gives a knob to control load balancing so that it occurs only at the point in computation when the benefits it provides outweighs the costs it induces. On NUMA machines where remote memory access is costly, the percentage of work scheduled dynamically should be small enough to avoid excessive cache misses, but large enough to keep the cores busy during idle times in the static part.

In this work, we show the effectiveness of this method in the context of already highly-optimized dense matrix factorizations. Our prior work on multi-threaded CALU was based on dynamic scheduling. The algorithm performed well on tall and skinny matrices, but became less scalable on square matrices with increasing numbers of processors. We show that the usage of this scheduling in communication avoiding dense factorization leads to significant performance gains. On a 48 core AMD Opteron NUMA machine, our experiments show that we can achieve up to 64% improvement over a version of CALU that uses fully dynamic scheduling, and up to 30% improvement over the version of CALU that uses fully static scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic scheduling approach is up to 8% faster than the version of CALU that uses a fully static scheduling or fully dynamic scheduling. Our algorithm leads to speedups over the corresponding routines for computing LU factorization in well known libraries. On the 48 core AMD NUMA machine, our best implementation is up to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to 82% faster than MKL. Our approach also shows significant speedups compared with PLASMA on both of these systems.